knitr::opts_chunk$set(warning=FALSE,message=FALSE,error=FALSE)
My policy on handwork is that it is required for Statistics PhD students and MS in Statistics students. It is encouraged (but not required) for MS in Applied Statistics and all other students. (If you are the latter, you will not get penalized if you get these wrong … )
Exercises from the book: 17.1, 17.2, 18.3, 18.5
You can type your answers and submit within this file or you can do this work on paper and submit a scanned copy (be sure it is legible).
Answer:
From the plot, local linear regression estimator is less biased, because it shows smaller gaps between the true line and itself.
And compared with kernel smoothing, local linear estimator shows better performance at the boundaries, because local weighted methods will be influenced by large-value points around.
library(stats)
set.seed(2)
#simulated data
x<-runif(100, 0, 100)
e<-rnorm(100, 0, 20)
y<-100-5*(x/10-5)^2+(x/10-5)^3+e
f<-function(x){
y<-100-5*(x/10-5)^2+(x/10-5)^3
return(y)
}
x.seq<-seq(min(x), max(x), length.out = 100)
# true line
Ey<-100-5*(x.seq/10-5)^2+(x.seq/10-5)^3
# local linear
fit.18.ll<-loess(y~x, degree = 1, span = 0.4)
y.ll.pred<-predict(fit.18.ll, data.frame(x=x.seq))
plot(y~x)
lines(x.seq, Ey, lwd=2)
lines(x.seq, y.ll.pred, col="red", lty=2, lwd=2)
#kernel regression/kernel smoothing
lines(ksmooth(x, y, "normal", bandwidth = 18), col="green", lty=3, lwd=3)
legend(67,-50, legend = c("true regression", "local-linear", "kernel smoothing"), col = c("black", "red", "green"), lty=c(1,2,3), cex=1 )
Answer:
Span=0.3 leads to smallest ASE.
It kind of confirm my visual selection in 18.3, because when span value rises, the regression line first fits better and then worse, which corresponds with the pattern of the graph. But span=0.3 cause the fitted line to be a little too wiggly, thus, I choose span=0.4, ACE doesn’t differ so much, though.
ss<-seq(0.05, 0.95, by=0.05)
n<-length(ss)
n.y<-length(y)
df.ase<-data.frame(span=ss, ASE=rep(0,n))
Ey0<-100-5*(x/10-5)^2+(x/10-5)^3
for (i in 1:n) {
sp<-ss[i]
fit<-loess(y~x, degree = 1, span = sp)
y.hat<-predict(fit)
df.ase[i,2]<-sum((y.hat-Ey0)^2)/n.y
}
with(df.ase, plot(ASE~span))
df.ase[which.min(df.ase$ASE),1]
## [1] 0.3
The data in ginzberg.txt (collected by Ginzberg) were analyzed by Monette (1990). The data are for a group of 82 psychiatric patients hospitalized for depression. The response variable in the data set is the patient’s score on the Beck scale, a widely used measure of depression. The explanatory variables are “simplicity” (measuring the degree to which the patient “sees the world in black and white”) and “fatalism”. (These three variables have been adjusted for other explanatory variables that can influence depression.)
Using the full quadratic regression model \(Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1^2 + \beta_4X_2^2 + \beta_5X_1X_2 + \epsilon\) regress the Beck-scale scores on simplicity and fatalism.
Answer:
Due to the scatterplots, it doesn’t seem that there is quadratic relationship between depression and fatalism, but there might be non-linear relationship between depression and simplicity, but not quadratic.
Due to the anova outcome, the interaction term is needed here, but the two quadratic terms don’t seem needed.
p.s. Is it possible for a model to include higher terms without lower ones??
# polynomial regression
library(car)
df.g<-Ginzberg
# check the relationship intuitively
scatterplot(adjdep~adjsimp, data=df.g)
scatterplot(adjdep~adjfatal, data=df.g)
fit.0<-lm(adjdep~adjsimp+adjfatal, data = df.g)
fit.1<-lm(adjdep~adjsimp+adjfatal+adjfatal*adjsimp+I(adjsimp^2)+I(adjfatal^2), data = df.g)
fit.1.1<-lm(adjdep~adjsimp+adjfatal+adjfatal*adjsimp, data = df.g)
fit.1.2<-lm(adjdep~adjsimp+adjfatal+I(adjsimp^2), data = df.g)
fit.1.3<-lm(adjdep~adjsimp+adjfatal+I(adjfatal^2), data = df.g)
#brief(fit.1) #brief summary of fit model
summary(fit.1)
##
## Call:
## lm(formula = adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp +
## I(adjsimp^2) + I(adjfatal^2), data = df.g)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63090 -0.23619 -0.06667 0.21295 1.04481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.10865 0.20687 -0.525 0.6010
## adjsimp 0.82039 0.29702 2.762 0.0072 **
## adjfatal 0.55247 0.28391 1.946 0.0554 .
## I(adjsimp^2) 0.08172 0.15374 0.532 0.5966
## I(adjfatal^2) 0.20927 0.18663 1.121 0.2657
## adjsimp:adjfatal -0.55366 0.27855 -1.988 0.0505 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3738 on 76 degrees of freedom
## Multiple R-squared: 0.4755, Adjusted R-squared: 0.441
## F-statistic: 13.78 on 5 and 76 DF, p-value: 1.416e-09
anova(fit.0, fit.1)
## Analysis of Variance Table
##
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp + I(adjsimp^2) +
## I(adjfatal^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 11.478
## 2 76 10.621 3 0.85711 2.0444 0.1147
anova(fit.0, fit.1.1)
## Analysis of Variance Table
##
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 11.478
## 2 78 10.798 1 0.67981 4.9106 0.02961 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.0, fit.1.2)
## Analysis of Variance Table
##
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + I(adjsimp^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 11.478
## 2 78 11.198 1 0.28013 1.9513 0.1664
anova(fit.0, fit.1.3)
## Analysis of Variance Table
##
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + I(adjfatal^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 79 11.478
## 2 78 11.397 1 0.081148 0.5554 0.4584
fit.2<-fit.1.1
Answer: from the plot, we can see that there are some outlier points, and there is also an obvious high-leverage point. In this way, there might also be some influential points.
# 3-dimensional plot; plane
library("plot3D")
x<-df.g$adjsimp
y<-df.g$adjfatal
z<-df.g$adjdep
fitpoints<-fit.2$fitted.values
grid.lines<-10
x.pred<-seq(min(df.g$adjsimp),max(df.g$adjsimp), length.out = grid.lines)
y.pred<-seq(min(df.g$adjfatal),max(df.g$adjfatal), length.out = grid.lines)
xy <- expand.grid(adjsimp = x.pred, adjfatal = y.pred) # create grids of x-y
z.pred <- matrix(predict(fit.2, newdata = xy), nrow = grid.lines, ncol = grid.lines) #should get predicted values of every grid of xy
scatter3D(x, y, z, pch = 19, cex = 1, colvar = NULL, col="red",
theta = 20, phi = 5, bty="b",
xlab = "simplicity", ylab = "fatalism", zlab = "depression",
surf = list(x = x.pred, y = y.pred, z = z.pred,
facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "depression regression")
Answer:
Due to Cook’s distance, we know that point 71, 65, 80 are influential. From the added-variable plots, point 65 is influential to all of the three terms, while 80 only seems to affect “simplicity” term. But it’s hard to tell how 71 will affect the regression lines if withdrawn.
influenceIndexPlot(fit.2)
influencePlot(fit.2, id.method="identify")
## StudRes Hat CookD
## 65 -0.7830169 0.61224209 0.24322281
## 70 2.7416683 0.01504210 0.02648577
## 71 3.7104658 0.09778806 0.32058125
## 80 1.7379439 0.22761790 0.21690988
avPlots(fit.2)
For this analysis, use the States.txt data, which includes average SAT scores for each state as the outcome.
SATM) as the outcome and region, pop, percent, dollars, and pay as the explanatory variables, each included as linear terms. Interpret the findings.Answer:
From the summary of this “pure” linear model, we can see that only “region” and “percent” are statistically significant. The \(R^2\) shows that this model somehow has explained quite a large percentage (around 85%) of the variance of the response (SAT math). It seems to be a good fit if we just consider \(R^2\). However, there is additional information from the crplots.
We can see the partial relationships between SAT math score and “percent” is “perfectly” linear, but it seems that there exists some missing data in the middle? And the relationships between the response and “pop”, “dollars”, “pay” are seemingly non-linear, i.e., there exist certain patterns, the response-“pay”-pattern is less obvious, though. In other words, there could be some relationships between them and we cannot just “kick them out” before further validation.
Thus, if we want to figure out the effect of covariates other than “region” and “percent”, the data need to be re-fitted using non-linear regressors, e.g., non-parametric terms.
df.s<-States
fit.lm<-lm(SATM~region+pop+percent+dollars+pay, data = df.s)
fit.lm.1<-lm(SATM~region+percent, data = df.s)
summary(fit.lm)
##
## Call:
## lm(formula = SATM ~ region + pop + percent + dollars + pay, data = df.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.4293 -8.4650 -0.6388 9.9749 23.5808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.158e+02 1.939e+01 26.603 < 2e-16 ***
## regionESC -2.811e+00 1.019e+01 -0.276 0.78419
## regionMA 1.215e+01 1.539e+01 0.789 0.43493
## regionMTN 1.305e+00 8.681e+00 0.150 0.88126
## regionNE 1.386e+01 1.270e+01 1.091 0.28197
## regionPAC 2.198e+00 9.715e+00 0.226 0.82219
## regionSA -1.191e+01 1.021e+01 -1.166 0.25083
## regionWNC 2.725e+01 8.931e+00 3.051 0.00414 **
## regionWSC -7.939e+00 1.011e+01 -0.785 0.43721
## pop -2.032e-04 5.037e-04 -0.403 0.68891
## percent -1.282e+00 2.099e-01 -6.109 4.04e-07 ***
## dollars 1.355e+00 4.062e+00 0.334 0.74053
## pay 4.951e-01 9.754e-01 0.508 0.61469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.41 on 38 degrees of freedom
## Multiple R-squared: 0.8856, Adjusted R-squared: 0.8495
## F-statistic: 24.52 on 12 and 38 DF, p-value: 3.062e-14
anova(fit.lm.1, fit.lm)
## Analysis of Variance Table
##
## Model 1: SATM ~ region + percent
## Model 2: SATM ~ region + pop + percent + dollars + pay
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 7188
## 2 38 6833 3 354.92 0.6579 0.583
crPlots(fit.lm)
Answer:
The general nonparametric regression model and the additive-regression model both show that among all the numeric variables, “percent” is the only one which is statistically significant.
And the additive-regression model (generalized version with a parametric part) gives the same result as the linear model, i.e., only “region”(regionWNC) and “percent” terms are statistically significant. From the plots of smooth terms, we can see “pay” and “pop” indeed don’t show any relationships with SATM, while “dollar” has shown a relationship, with large variance, though.
It seems that the population, the investment on public education, the salary of teachers are uncorrelated with students’ SAT math score, or at least, cannot be modeled by local polynomial model and additive regression model.
library(mgcv)
library(stats)
# general nonparametric regression model (I used to think this refers to Y^=f(x1,x2,..,xp))
# local polynomial (<= 4 variables)
fit.lp.0<-loess(SATM~pop+percent+dollars+pay, degree = 1, data = df.s)# degree: power of the highest polynomial term
#summary(fit.lp.0) #show nothing...
#find significant terms
fit.lp.1<-loess(SATM~percent+dollars+pay, degree = 1, data = df.s)
fit.lp.2<-loess(SATM~pop+dollars+pay, degree = 1, data = df.s)
fit.lp.3<-loess(SATM~pop+percent+pay, degree = 1, data = df.s)
fit.lp.4<-loess(SATM~pop+percent+dollars, degree = 1, data = df.s)
anova(fit.lp.0, fit.lp.1)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ percent + dollars + pay, data = df.s, degree = 1)
##
## Analysis of Variance: denominator df 39.26
##
## ENP RSS F-value Pr(>F)
## [1,] 8.32 11259
## [2,] 6.55 10143 0.73051 0.5599
anova(fit.lp.0, fit.lp.2)#percent->significant
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + dollars + pay, data = df.s, degree = 1)
##
## Analysis of Variance: denominator df 39.26
##
## ENP RSS F-value Pr(>F)
## [1,] 8.32 11259
## [2,] 6.93 36659 20.169 7.38e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.lp.0, fit.lp.3)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + percent + pay, data = df.s, degree = 1)
##
## Analysis of Variance: denominator df 39.26
##
## ENP RSS F-value Pr(>F)
## [1,] 8.32 11259.4
## [2,] 7.21 9751.1 1.4131 0.2556
anova(fit.lp.0, fit.lp.4)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + percent + dollars, data = df.s, degree = 1)
##
## Analysis of Variance: denominator df 39.26
##
## ENP RSS F-value Pr(>F)
## [1,] 8.32 11259
## [2,] 6.94 10016 0.98897 0.4039
fit.lp<-loess(SATM~percent, degree = 1, data = df.s)
#plot fitted model
plot(SATM~percent, data=df.s)
ss<-with(df.s, seq(min(percent), max(percent), length.out = 200))
satm.pred<-predict(fit.lp, data.frame(percent=ss))
lines(ss, satm.pred, col="red")
lines(with(df.s, smooth.spline(percent, SATM, df=3.85)), col="green") #smooth splines
#fit.gen<-gam(SATM~s(percent), data = df.s)
# additive-regression model(semi-parametric in order to add "region")
fit.add<-gam(SATM~region+s(pop)+s(percent)+s(dollars)+s(pay), data = df.s)
summary(fit.add)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## SATM ~ region + s(pop) + s(percent) + s(dollars) + s(pay)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 492.5039 6.1674 79.856 < 2e-16 ***
## regionESC 2.8933 9.5428 0.303 0.76364
## regionMA -4.1707 14.7132 -0.283 0.77858
## regionMTN 13.7467 8.1271 1.691 0.10014
## regionNE -2.2987 12.0695 -0.190 0.85011
## regionPAC 12.6397 8.8165 1.434 0.16105
## regionSA -12.6855 8.5796 -1.479 0.14871
## regionWNC 29.5165 8.1956 3.601 0.00102 **
## regionWSC -0.3958 9.6652 -0.041 0.96758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(pop) 1.000 1.000 0.143 0.708
## s(percent) 2.633 3.355 18.879 2.1e-08 ***
## s(dollars) 4.263 5.295 1.780 0.124
## s(pay) 1.000 1.000 0.149 0.702
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.901 Deviance explained = 93.5%
## GCV = 182.01 Scale est. = 118.15 n = 51
plot(fit.add)
Answer:
I have tried log transformation for the response, and it doesn’t seem to improve a lot.
Log transformation of “pop” seems to make it uncorrelated with the response, because the component-residual plot shows a seemingly random pattern around y=0.
As for “dollars” and “pay”, neither log transformation nor adding polynomial terms seem to improve the fit, i.e., their relationships with SAT math couldn’t be captured as “simple” parametric forms, maybe they are too complex. Or more likely, due to the original crPlots and the further results from non-parametric models, they do not have special relationships with SAT math.
Interpretation: compared with parametric approaches, non-parametric methods are more flexible and fit the data better, but they are more difficult to interpret since the smoothers are usually very complicated.
Multivariate: general non-parametric models suffer from “the curse of dimensionality” and additive regression models are restricted to separate fitting for each explanatory variables; while parametric models are able to fit both separate effects and interactive effects
Generalization: since non-parametric models rely more on the data, over-fitting is inevitable, i.e., they couldn’t be generalized to other datasets easily. Conversely, although the parametric models cannot “please” the existing data well enough due to their “too-simple” structures, they are able to “adapt to” other datasets better.
p.s. semi-parametric methods seem to be the best……
fit.lm.tr1<-lm(log(SATM)~region+pop+percent+dollars+pay, data = df.s)
fit.lm.tr2<-lm(log(SATM)~region+pop+percent+dollars+pay, data = df.s)
fit.lm.tr3<-lm(SATM~region+poly(pop,2, raw=TRUE)+percent+poly(dollars, 5, raw = TRUE)+poly(pay, 5, raw = TRUE), data = df.s)
summary(fit.lm.tr1)
##
## Call:
## lm(formula = log(SATM) ~ region + pop + percent + dollars + pay,
## data = df.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044595 -0.017204 0.000329 0.019727 0.049866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.241e+00 3.890e-02 160.455 < 2e-16 ***
## regionESC -4.775e-03 2.045e-02 -0.233 0.8166
## regionMA 2.675e-02 3.088e-02 0.866 0.3918
## regionMTN 3.987e-03 1.742e-02 0.229 0.8201
## regionNE 3.035e-02 2.548e-02 1.191 0.2410
## regionPAC 6.615e-03 1.949e-02 0.339 0.7362
## regionSA -2.410e-02 2.049e-02 -1.176 0.2469
## regionWNC 5.120e-02 1.792e-02 2.857 0.0069 **
## regionWSC -1.463e-02 2.029e-02 -0.721 0.4751
## pop -4.105e-07 1.011e-06 -0.406 0.6869
## percent -2.625e-03 4.211e-04 -6.234 2.72e-07 ***
## dollars 2.337e-03 8.149e-03 0.287 0.7758
## pay 1.189e-03 1.957e-03 0.607 0.5472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0269 on 38 degrees of freedom
## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8489
## F-statistic: 24.41 on 12 and 38 DF, p-value: 3.295e-14
summary(fit.lm.tr2)
##
## Call:
## lm(formula = log(SATM) ~ region + pop + percent + dollars + pay,
## data = df.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044595 -0.017204 0.000329 0.019727 0.049866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.241e+00 3.890e-02 160.455 < 2e-16 ***
## regionESC -4.775e-03 2.045e-02 -0.233 0.8166
## regionMA 2.675e-02 3.088e-02 0.866 0.3918
## regionMTN 3.987e-03 1.742e-02 0.229 0.8201
## regionNE 3.035e-02 2.548e-02 1.191 0.2410
## regionPAC 6.615e-03 1.949e-02 0.339 0.7362
## regionSA -2.410e-02 2.049e-02 -1.176 0.2469
## regionWNC 5.120e-02 1.792e-02 2.857 0.0069 **
## regionWSC -1.463e-02 2.029e-02 -0.721 0.4751
## pop -4.105e-07 1.011e-06 -0.406 0.6869
## percent -2.625e-03 4.211e-04 -6.234 2.72e-07 ***
## dollars 2.337e-03 8.149e-03 0.287 0.7758
## pay 1.189e-03 1.957e-03 0.607 0.5472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0269 on 38 degrees of freedom
## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8489
## F-statistic: 24.41 on 12 and 38 DF, p-value: 3.295e-14
summary(fit.lm.tr3)
##
## Call:
## lm(formula = SATM ~ region + poly(pop, 2, raw = TRUE) + percent +
## poly(dollars, 5, raw = TRUE) + poly(pay, 5, raw = TRUE),
## data = df.s)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.3692 -8.4480 -0.4978 8.4389 22.5876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.989e+03 1.865e+04 -0.482 0.633451
## regionESC 2.182e+01 1.469e+01 1.485 0.148340
## regionMA 6.983e+00 1.817e+01 0.384 0.703484
## regionMTN 1.268e+01 1.164e+01 1.089 0.284953
## regionNE 1.272e+01 1.629e+01 0.781 0.441098
## regionPAC 2.338e+00 1.298e+01 0.180 0.858340
## regionSA -9.261e+00 1.210e+01 -0.766 0.450072
## regionWNC 3.920e+01 1.109e+01 3.535 0.001390 **
## regionWSC 8.404e+00 1.315e+01 0.639 0.527892
## poly(pop, 2, raw = TRUE)1 -4.518e-05 1.631e-03 -0.028 0.978092
## poly(pop, 2, raw = TRUE)2 -2.132e-09 5.973e-08 -0.036 0.971767
## percent -1.117e+00 2.539e-01 -4.397 0.000135 ***
## poly(dollars, 5, raw = TRUE)1 -1.782e+03 1.491e+03 -1.195 0.241798
## poly(dollars, 5, raw = TRUE)2 6.141e+02 5.494e+02 1.118 0.272872
## poly(dollars, 5, raw = TRUE)3 -1.003e+02 9.777e+01 -1.026 0.313440
## poly(dollars, 5, raw = TRUE)4 7.811e+00 8.418e+00 0.928 0.361150
## poly(dollars, 5, raw = TRUE)5 -2.333e-01 2.813e-01 -0.829 0.413755
## poly(pay, 5, raw = TRUE)1 1.950e+03 3.071e+03 0.635 0.530411
## poly(pay, 5, raw = TRUE)2 -1.297e+02 1.980e+02 -0.655 0.517555
## poly(pay, 5, raw = TRUE)3 4.222e+00 6.300e+00 0.670 0.508093
## poly(pay, 5, raw = TRUE)4 -6.740e-02 9.900e-02 -0.681 0.501428
## poly(pay, 5, raw = TRUE)5 4.229e-04 6.146e-04 0.688 0.496849
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.65 on 29 degrees of freedom
## Multiple R-squared: 0.9095, Adjusted R-squared: 0.844
## F-statistic: 13.88 on 21 and 29 DF, p-value: 5.242e-10
crPlots(fit.lm.tr3)
Return to the Chile.txt dataset used in HW2. Reanalyze the data employing generalized nonparametric regression (including generalized additive) models.
Answer:
From the several anova results, we can see that age, statusquo, education are the three statistically significant variables. The “statusquo” is the only one which has non-linear relationship (like polynomial, close to quadratic in the middle part) with the response “vote”, but at borders, the variance is large.
library(arm)
library(dplyr)
df.c<-Chile %>%
na.omit()
fit.gam.0<-gam(vote~region+as.factor(population)+sex+s(age)+education+income+s(statusquo), family = binomial, data=df.c) #age, statusquo, education are significant
fit.gam.1<-gam(vote~s(age)+education+s(statusquo), family = binomial, data=df.c)
anova(fit.gam.1, fit.gam.0, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: vote ~ s(age) + education + s(statusquo)
## Model 2: vote ~ region + as.factor(population) + sex + s(age) + education +
## income + s(statusquo)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2422.5 1186.1
## 2 2408.2 1168.8 14.351 17.344 0.2589
fit.gam.2<-gam(vote~age+education+s(statusquo), family = binomial, data=df.c)
anova(fit.gam.2, fit.gam.1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: vote ~ age + education + s(statusquo)
## Model 2: vote ~ s(age) + education + s(statusquo)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2423.6 1187.6
## 2 2422.5 1186.1 1.0579 1.4911 0.2368
fit.gam.3<-gam(vote~s(age)+education+statusquo, family = binomial, data=df.c)
anova(fit.gam.3, fit.gam.1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: vote ~ s(age) + education + statusquo
## Model 2: vote ~ s(age) + education + s(statusquo)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2424.7 1233.6
## 2 2422.5 1186.1 2.1675 47.505 6.601e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.gam<-fit.gam.2
summary(fit.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## vote ~ age + education + s(statusquo)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.334742 0.302930 7.707 1.29e-14 ***
## age 0.017437 0.006382 2.732 0.00629 **
## educationPS -0.211363 0.250555 -0.844 0.39890
## educationS -0.486749 0.193130 -2.520 0.01173 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(statusquo) 2.759 3.435 41.2 2.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0293 Deviance explained = 6.36%
## UBRE = -0.50591 Scale est. = 1 n = 2431
plot(fit.gam)
Answer:
Yes, adding higher-order terms of “statusquo”. If the degree of the polynomial is 3, from the anova, we can see that this model performs better than the non-parametric model. Thus, we have successfully fit the nonlinearity of the relationship.
plot(vote~statusquo, data = df.c)
fit.glm<-gam(vote~age+education+poly(statusquo, 3, raw = TRUE), family = binomial, data=df.c)
anova(fit.glm, fit.gam, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: vote ~ age + education + poly(statusquo, 3, raw = TRUE)
## Model 2: vote ~ age + education + s(statusquo)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 2424.0 1186.5
## 2 2423.6 1187.6 0.43452 -1.082
For this analysis, use the Duncan.txt data. Here we are interested in the outcome prestige and the explanatory variable income.
library(car)
df.d<-Duncan
fit.ll<-loess(prestige ~ income, data = df.d, span = 0.6, degree = 1)
summary(fit.ll)
## Call:
## loess(formula = prestige ~ income, data = df.d, span = 0.6, degree = 1)
##
## Number of Observations: 45
## Equivalent Number of Parameters: 3.12
## Residual Standard Error: 17.53
## Trace of smoother matrix: 3.61 (exact)
##
## Control settings:
## span : 0.6
## degree : 1
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE
## drop.square: FALSE
Answer:
From the plot, we can see the two fitted lines are similar until “income” value greater than 60. And fourth order polynomial regression is more “curved” at the right boundary.
fit.p4<-lm(prestige ~ poly(income, 4, raw = TRUE), data = df.d)
inc.200<-with(df.d, seq(min(income), max(income), length.out = 200))
pred.ll<-predict(fit.ll, data.frame(income=inc.200))
pred.p4<-predict(fit.p4, data.frame(income=inc.200))
#anova(fit.ll, fit.p4, test="Chisq")
plot(prestige~income, data = df.d)
lines(inc.200, pred.ll, lty=1, lwd=2, col="red")
lines(inc.200, pred.p4, lty=2, lwd=2, col="green")
legend(5, 98, legend=c("local-linear regression", "4th order polynomial"),
col=c("red","green"), lty=1:2, cex=1)